How Similar Are Drug Data and Disease Self-report? Estimating the Prevalence of Chronic Diseases in Less Developed Settings

Background: Drug data has been used to estimate the prevalence of chronic diseases. Disease registries and annual surveys are lacking, especially in less-developed regions. At the same time, insurance drug data and self-reports of medications are easily accessible and inexpensive. We aim to investigate the similarity of prevalence estimation between self-report data of some chronic diseases and drug data in a less developed setting in southwestern Iran. Methods: Baseline data from the Pars Cohort Study (PCS) was re-analyzed. The use of disease-related drugs were compared against self-report of each disease (hypertension [HTN], diabetes mellitus [DM], heart disease, stroke, chronic obstructive pulmonary disease [COPD], sleep disorder, anxiety, depression, gastroesophageal reflux disease [GERD], irritable bowel syndrome [IBS], and functional constipation [FC]). We used sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the Jaccard similarity index. Results: The top five similarities were observed in DM (54%), HTN (53%), heart disease (32%), COPD (30%), and GERD (15%). The similarity between drug use and self-report was found to be low in IBS (2%), stroke (5%), depression (9%), sleep disorders (10%), and anxiety disorders (11%). Conclusion: Self-reports of diseases and the drug data show a different picture of most diseases’ prevalence in our setting. It seems that drug data alone cannot estimate the prevalence of diseases in settings similar to ours. We recommend using drug data in combination with self-report data for epidemiological investigation in the less-developed setting.


Introduction
Different types of data have been used to estimate the prevalence of diseases.Self-report data, clinical examination, and paraclinical data are commonly used to find the prevalence of diseases.Like any other subjective data, self-report data is associated with inaccurate reporting due to recall bias, social desirability bias, 1 and cognitive deficit.Low cost and practicality are advantages of this method. 2Clinical examination and paraclinical data are used for detection of diseases.These types of data are costly but more accurate.Although self-reporting serves as a crucial source of information for physicians in clinical settings, it is frequently observed that patients' medical history does not align with their medication history.This challenges physicians to assess the reliability of their patients' claims about having certain diseases.This clinical problem can have broader implications in epidemiology, particularly in studies that rely on self-Drug data vs. self-report in disease prevalence reported disease data, such as prevalence surveys.
Drug data has also been used to estimate the prevalence of diseases. 3,4It is objective, accessible, inexpensive, and majorly registered in insurance databases or health records.Health surveillance systems, disease registries, insurance databases, and regular national health surveys are available in more developed regions, while in lessdeveloped areas, sustainable and integrated surveillance is lacking. 5,6Thus, we hypothesized that drug data is suitable in less-developed settings due to its feasibility and inexpensiveness.
[9][10] Only a few studies have compared medication data with other sources.A study by Chini et al showed that drug data could be used to provide reliable prevalence estimates of several chronic diseases.This study was conducted in Lazio, a region in central Italy.This study used data from one registry for drug data, which collects drug data of prescribed medications by general physicians or outpatient centers (not hospitals).Disease data were gathered from two registries that collect health records provided by health care units and a population-based survey that gathered self-reports of diseases. 3In another study, Hafferty et al compared the validity of self-reported certain drug use to national prescription data.They found that self-reported drugs were accurate compared to prescription data.This study used self-reported drug data from a cohort of Scottish adults.The prescribed drug data came from health records in the National Health Information Registry. 11However, we did not find a study that compared medication data with self-reported data.
This study investigates the similarity of medication use to patients' self-report of diseases.Here, we try to answer whether the drug data estimates the disease prevalence similar to self-report data.We investigated the similarity between self-reporting of chronic diseases and drug data, using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and Jaccard similarity index.

Study Design
This is a cross-sectional study, and the data was obtained from the Pars Cohort Study (PCS).PCS started in 2014 and is still ongoing in Valashahr, the Fars province.

Setting
Valashahr is a county of the Fars province in southern Iran.This area is a semi-urban area with 40 000 inhabitants.The inhabitants of this region are mostly of Turkish and Persian ethnicities.Primary health centers and general practitioners in the private sector provide health services in this area.Specialized and sub-specialized services are available in Fars' surrounding cities and capital (Shiraz).More information about PCS can be found elsewhere. 12,13ta Collection Demographic Data Using a standardized questionnaire, the PCS study gathered sex, age, ethnicity, education, and socioeconomic status (SES).12 Age is categorized into three groups: under 50, between 50 and 60, and over 60 years; ethnicity was tagged as Persian and non-Persian; education was classified as illiterate, less than diploma, and university.SES was calculated based on a latent variable measured by analysis of the participants' assets using multiple correspondence analysis (MCA), and participants were categorized into quartiles of this variable (low, lowmiddle, middle-high, and high).

Disease Data
A total of 9264 individuals aged 40-75 years participated in PCS.They were interviewed by a local physician and a trained nurse face to face.In this interview, the participants were asked, "Has your physician told you that you have the disease X and need treatment?".If they answered "Yes", they were categorized as having disease X.
History of hypertension [HTN], diabetes mellitus (DM), heart disease, stroke, chronic obstructive pulmonary disease (COPD), sleep disorder, anxiety, depression, gastroesophageal reflux disease (GERD), irritable bowel syndrome (IBS), and functional constipation (FC) were obtained from the participants.Patients were asked about their symptoms to diagnose IBS, GERD, and FC, and ROME IV criteria were applied.

Drug Data
Participants were asked to bring all their drugs with them during the interview.A trained nurse recorded the medications they had been taking for at least the past three months.

Drug Classification
We used the first and second levels of the Anatomical Therapeutic Chemical (ATC) classification system to classify drugs.We merged drug classes with lower prevalence in the first level of ATC (e.g.Aot).

Test Methods
Drugs used in each disease were listed based on UpToDate, and an expert panel consisting of two physicians and pharmacists reviewed the list to determine whether the drugs were included correctly or not.The use of drugs was defined as a binary variable for each disease in the analysis as a test.Each disease's self-report was also considered a binary variable as a reference standard (Table 1).In this study, we utilized self-report data as a prevalent existing standard, despite its imperfections.5][16] However, if anyone believes that the reference standard should be the drug data, it can be achieved by referring to the description provided in Table 1.For those disorders for which self-report was not available in the PCS, only the estimated prevalence was derived from their drug use pattern (more details could be find in Table S1).

Statistical Methods
Frequency (%) was used to describe the population and the pattern of drug use and diseases.We estimated the sensitivity, specificity, PPV, NPV, and 95% confidence interval (CI).We also used the Jaccard index to examine the similarities between disease self-report and drug use.The Jaccard index is the ratio of positive cases in both methods to the positive cases in either method. 17herefore, when both test and standard reference methods report identically, the Jaccard index equals 100%, indicating the most similarity. 17The majority of the population is neither diseased nor takes any medication (TN).We chose the Jaccard similarity index to avoid the effect of the large TNs.

Discussion
In this study, we aimed to investigate the similarity of selfreport data of diseases and drug data.We found that in most cases, drug data show low similarity to the self-report data.To estimate the prevalence of diseases in our setting, however, drug data does not seem to be a suitable tool to use for patients with IBS, GERD, FC, Stroke, COPD, sleep disorders, diabetes anxiety, heart disease, and depression.In these diseases, patients were asked about their symptoms and were diagnosed based on ROME IV criteria.
report data.So, our hypothesis that drug data are suitable for estimating the prevalence of diseases in less-developed settings seems to be rejected.DM was under-reported by drug data, possibly because of the low medication adherence rate, which leads to less drug use in diseased individuals. 18Low disease awareness, perception of illness, and health-seeking behavior were also reported as factors contributing to the low rate of self-report of diseases. 19,20Moreover, a study based on pharmacy claim data in Iran reported that the DM prevalence rate was 6.4%, similar to our results 4 .Drug data also underreported COPD.Complex treatment regimens contributed to low medication adherence in COPD patients. 21,22Previous studies also indicated moderate health literacy in Iranian COPD patients, 23 and health literacy correlated with medication adherence. 24n both cases of DM and COPD, PPV was above 90%, demonstrating that positive drug use data is a good predictor of self-report.A similar study indicated that the prevalence of COPD drug use was 5.2%, 4 which may be due to population variation.The mentioned study used data from both urban and rural areas.
Drug data under-report IBS and FC prevalence.In both IBS and FC similarity index, sensitivity and PPV are low.Both these diseases were diagnosed based on ROME IV criteria through the interview; therefore, many patients were not diagnosed before the session and were not treated.Treatment of these diseases consists of lifestyle and dietary modification, physical activity, and pharmacotherapy. 25Accordingly, not all patients receive medication, 26,27 leading to high false negative (FN).Overthe-counter (OTC) access to these drugs in patients with other differential diagnoses is probably the cause of drug use in ROME IV-negative individuals, contributing to high false positive (FP).
Drug data under-report the prevalence of sleep disorders, anxiety, and depression.Similarity index, sensitivity, and PPV are low, which implies high FN and FP.The high number of patients with these diseases who do not use medication may be caused by undertreatment, 28,29 low medication adherence, 30,31 and other non-pharmacological treatments such as cognitive behavioral therapy.Drugs used in treating these diseases overlap with each other and other mental disorders.Social stigma for mental disorders may also affect self-report of these diseases. 32he prevalence of self-reported estimates and the drug data are similar in HTN.Sensitivity, PPV, and similarity index are low, remarking that the individuals identified by each are different.Drugs used in HTN are commonly used in other diseases like heart disease and stroke.Also, many hypertensive patients are undiagnosed, insufficiently treated, or have low compliance. 19Therefore, the mentioned indices are low.
Drug data over-report heart disease, stroke, and GERD.Drugs used to treat heart disease, stroke, and HTN overlap with each other.In another study, these three diseases were reported as a pooled group of cardiovascular diseases. 4It seems that drug data cannot provide the prevalence of the overlapped diseases separately.GERD symptoms are common in the population, overlap with other differential diagnoses, and its drugs are available OTC. Patients with minimal symptoms self-medicate, 33 while ROME IV criteria are not met.DM and COPD are the two diseases that have no overlap with other diseases assessed here in terms of medical therapy.
The pharmacotherapy rate among COPD patients varies significantly between genders; the male pharmacotherapy rate was higher than the female, probably due to the higher prevalence, severity, and diagnosis of COPD in men, as reported. 34Pharmacotherapy decreased with aging in men except in COPD and stroke.In women, aging increased the pharmacotherapy rate in most diseases except for IBS and mental disorders.So, these results show that the pharmacotherapy rate varies across demographic groups and diseases.
Controversy exists concerning self-report validity in the literature. 7,8Some studies showed that self-reporting had a substantial agreement with health records or underreporting of the prevalence of diseases.In one study, Smith et al compared self-reporting of diseases to health records and found that it is better to use self-report data to rule out the diseases and use more objective data for prevalence studies. 10Self-report data is thought to be affected by recall bias, desirability bias, 1 cognitive deficit, and language barrier.Drug data, a more objective data source, seems less influenced by these shortcomings.It should be considered that drug data itself can be affected by many factors, including adherence to medications, knowledge, attitude, and practice of physicians, access to health services, and pharmacotherapy rate of the disease.The treatment strategy of the diseases influences the disease pharmacotherapy rate.For example, in diseases where non-pharmacological interventions such as lifestyle and dietary modifications and cognitive behavioral therapy are used commonly, drug data cannot be used to estimate the prevalence of those diseases.Also, different health-seeking behaviors across diverse demographic groups lead to different pharmacotherapy rates; this may lead to overestimation or underestimation in different population groups, which can affect the representativeness and generalizability of the drug data results.It should be kept in mind that using the same drug for different diseases or off-label use of drugs can also interfere with prevalence estimation.This could lead to the overestimation of some diseases.Thus, combining drug data with self-report data can be a suitable alternative to using either alone.
We only compared self-report data with drug data.Using other data sources, such as health records, provides us with a more accurate understanding of drug data's ability to estimate disease prevalence.So, we acknowledge that utilizing health records as a means to compare drug data could potentially offer advantages.However, it is important to note that the unavailability of such data is a shared limitation in our setting and other regions of Iran.We used large groups of drugs (1 st and 2 nd level of ATC) in this study.It is assumed that using large groups of drugs can lead to overestimation.However, our results indicated an underestimation in prevalence and low sensitivity of drug data in most diseases.So, if smaller groups of drugs were applied, the gap between drug data and self-report data would have become even wider.We also considered self-report data as the reference standard to compare these two data sources.Although self-report has several biases, we use the similarity index, which is not affected by changing the reference standard.Other indices we use in our study could also be converted to each other if the reference standard changes (Table 1).

Conclusion
In conclusion, self-reports of diseases and the drug data show a different picture of most diseases' prevalence in our setting.It seems that drug data alone cannot estimate the prevalence of diseases in settings similar to our study.We recommend using drug data in combination with selfreport data for epidemiological investigation in the lessdeveloped setting.

Table 1 .
Confusion Matrix and the Effects of Changing Standard Reference on Each Index TP, true positive; TN, true negative; FP, false positive; PPV, positive predictive value; NPV, negative predictive value.

Table 2 .
Demographic Characteristics and Comorbidities of Participants

Table 3 .
Prevalence, Pharmacotherapy Rate and Performance Metrics of Self-report Versus Drug Use Data b Other subclasses of A group witsh lower use c